-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML API: Add normalization functions. #7331
Conversation
The HTML Processor understands HTML regardless of how it's written, but many other functions are unable to do so. There are all sorts of syntax peculiarities and semantics that would be helpful to eliminate using the knowledge contained in the HTML Processor. This patch introduces `WP_HTML_Processor::normalize( $html )` as a method which takes a fragment of HTML as input and then returns a serialized version of the input, "cleaning it up" by balancing all tags, providing all missing optional tags, re-encoding all text, removing all duplicate attributes, and double-quote-escaping all attribute values. Core-62036
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN:
To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice: force_balance_tags()
: The Next Generation
*/ | ||
public function serialize(): ?string { | ||
if ( WP_HTML_Tag_Processor::STATE_READY !== $this->parser_state ) { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this make sense to throw an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to avoid throwing exceptions in use code. Tell me more about the value of potentially crashing vs. returning null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I guess to give more information about why it is returning null
. Maybe _doing_it_wrong()
then would be better? It would be helpful to get feedback in code for what is documented:
* This differs from {@see WP_HTML_Processor::normalize} in that it starts with
* a specific HTML Processor, which _must_ not have already started scanning;
* it must be in the initial ready state and will be in the completed state once
* serialization is complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah okay I see now. another thought I had was resetting to the beginning, parsing, and returning to the previous location, which involves double-parsing if already mid-way through a document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've called wp_trigger_error()
in these cases.
while ( $this->next_token() ) { | ||
$token_type = $this->get_token_type(); | ||
|
||
switch ( $token_type ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah good catch: it should also serialize the PI node tag name, which would match what you wrote. looks like this needs a review of all of the invalid comment syntax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should all be updated now. if something lingers I'd like to fix it, but ultimately if we botch an invalid comment, I'm guessing it's not the end of the world.
these will go into test cases.
} | ||
|
||
if ( ! $in_html && $this->has_self_closing_flag() ) { | ||
$html .= '/'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While not required, it seems a space is usually added here in the wild, right? (e.g. Prettier does this)
$html .= '/'; | |
$html .= ' /'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a good thought. with double-quoted attributes it's not relevant, but with unquoted attribute values it becomes relevant. we don't need that since we control quoting. maybe it's best to add it in anyway for the same of other tools.
if ( null !== $this->get_last_error() ) { | ||
return null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here, it would be helpful to know why it returned null
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a tougher question because it would conflate the basic ?string
return value. practically I think this can only occur if the HTML is unsupported (in which case we really shouldn't return any processed string) or we've run out of bookmarks (which should be unrealistically rare - and that reminds me, I found 2500 bookmarks sufficient to parse everything in my set of ~300k HTML documents, and I intend on upping the default value to support that for 6.7).
suppose you know why this failed: what would you do in response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could also use _doing_it_wrong()
here too to communicate that information, I suppose. Or rather, wp_trigger_error()
would be the more relevant function. If I knew why it failed then I wouldn't have to figure out why it failed. True it probably wouldn't impact the result in the end, but for debugging it would be useful.
Looking at where last_error
is set, it seems to always coincide with throwing an exception anyway. So in practice would this if
statement ever be entered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the exceptions thrown internally are caught and shut down parsing, but does not crash. unsupported content exceptions are returned via get_unsupported_exception()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've called wp_trigger_error()
in these cases.
Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>
If code later in the processing pipeline adds unquoted attributes and doesn't add the requisite space following that, then another parser might find that the solidus is part of the attribute value instead of serving as a self-closing flag. Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>
Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>
Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>
Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty exciting, I'd like to start adding tests for it.
I just added it to the html api debugger when supported.
I'd love to start adding tests for this. One good test will be idempotency, where after an initial normalization, subsequent normalizations will be identical.
This mentions null bytes here specifically:
Text will be re-encoded, null bytes handled, and invalid UTF-8 replaced with U+FFFD.
I think that's working correctly in text. Should it also be handled in tag names, attribute names, and attribute values?
Input (null bytes replaces for clarity)
<div␀-nb nb-att-␀-="nb-val-␀-">
Normalized output:
<div␀-nb nb-att-␀-="nb-val-␀-"></div␀-nb>
Expected:
<div�-nb nb-att-�-="nb-val-�-"></div�-nb>
There are some known issues from HTML5lib tests similar to the PI problems mentioned here: #7331 (comment) wordpress-develop/tests/phpunit/tests/html-api/wpHtmlProcessorHtml5lib.php Lines 28 to 30 in 4712210
There's no good way to read the comment under some circumstances and something like a wordpress-develop/tests/phpunit/data/html5lib-tests/tree-construction/comments01.dat Line 156 in 4712210
wordpress-develop/tests/phpunit/data/html5lib-tests/tree-construction/comments01.dat Line 170 in 4712210
wordpress-develop/tests/phpunit/data/html5lib-tests/tree-construction/html5test-com.dat Line 130 in 4712210
Each of these does not satisfy the PI constraint (missing There are a couple of cases like this, I think they're all CORRECTION: I've edited it, it was initially incorrect. The bogus comments starting with
|
We'd talked about a method to really inspect different types of comment text content. I've proposed a method in #7342. That would be helpful here. |
c5f1924
to
92558b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is nice and it seems like it's in a good place. I'm happy to see it getting tests.
I'd like it if null bytes were normalized in more places (tag names, attribute names and values) before this lands.
HTML often appears in ways that are unexpected. It may be missing implicit tags, may have unquoted, single-quoted, or double-quoted attributes, may contain duplicate attributes, may contain unescaped text content, or any number of other possible invalid constructions. The HTML API understands all fo these inputs, but downline parsers may not, and HTML snippets which are safe on their own may introduce problems when joined with other HTML snippets. This patch introduces the `serialize()` method on the HTML Processor, which prints a fully-normative HTML output, eliminating invalid markup along the way. It produces a string which contains every missing tag, double-quoted attributes, and no duplicates. A `normalize()` static method on the HTML Processor provides a convenient wrapper for constructing a fragment parser and immediately serializing. Subclasses relying on the `serialize_token()` method may perform structural HTML modifications with as much security as the upcoming `\Dom\HTMLDocument()` parser will, though these are not able to provide the full safety that will eventually appear with `set_inner_html()`. Further work may explore serializing to XML (which involves a number of other important transformations) and adding constraints to serialization (such as only allowing inline/flow/formatting elements and text). Developed in #7331 Discussed in https://core.trac.wordpress.org/ticket/62036 Props dmsnell, jonsurrell, westonruter. Fixes #62036. git-svn-id: https://develop.svn.wordpress.org/trunk@59076 602fd350-edb4-49c9-b593-d223f7449a82
HTML often appears in ways that are unexpected. It may be missing implicit tags, may have unquoted, single-quoted, or double-quoted attributes, may contain duplicate attributes, may contain unescaped text content, or any number of other possible invalid constructions. The HTML API understands all fo these inputs, but downline parsers may not, and HTML snippets which are safe on their own may introduce problems when joined with other HTML snippets. This patch introduces the `serialize()` method on the HTML Processor, which prints a fully-normative HTML output, eliminating invalid markup along the way. It produces a string which contains every missing tag, double-quoted attributes, and no duplicates. A `normalize()` static method on the HTML Processor provides a convenient wrapper for constructing a fragment parser and immediately serializing. Subclasses relying on the `serialize_token()` method may perform structural HTML modifications with as much security as the upcoming `\Dom\HTMLDocument()` parser will, though these are not able to provide the full safety that will eventually appear with `set_inner_html()`. Further work may explore serializing to XML (which involves a number of other important transformations) and adding constraints to serialization (such as only allowing inline/flow/formatting elements and text). Developed in WordPress/wordpress-develop#7331 Discussed in https://core.trac.wordpress.org/ticket/62036 Props dmsnell, jonsurrell, westonruter. Fixes #62036. Built from https://develop.svn.wordpress.org/trunk@59076 git-svn-id: http://core.svn.wordpress.org/trunk@58472 1a063a9b-81f0-0310-95a4-ce76da25c4cd
HTML often appears in ways that are unexpected. It may be missing implicit tags, may have unquoted, single-quoted, or double-quoted attributes, may contain duplicate attributes, may contain unescaped text content, or any number of other possible invalid constructions. The HTML API understands all fo these inputs, but downline parsers may not, and HTML snippets which are safe on their own may introduce problems when joined with other HTML snippets. This patch introduces the `serialize()` method on the HTML Processor, which prints a fully-normative HTML output, eliminating invalid markup along the way. It produces a string which contains every missing tag, double-quoted attributes, and no duplicates. A `normalize()` static method on the HTML Processor provides a convenient wrapper for constructing a fragment parser and immediately serializing. Subclasses relying on the `serialize_token()` method may perform structural HTML modifications with as much security as the upcoming `\Dom\HTMLDocument()` parser will, though these are not able to provide the full safety that will eventually appear with `set_inner_html()`. Further work may explore serializing to XML (which involves a number of other important transformations) and adding constraints to serialization (such as only allowing inline/flow/formatting elements and text). Developed in WordPress/wordpress-develop#7331 Discussed in https://core.trac.wordpress.org/ticket/62036 Props dmsnell, jonsurrell, westonruter. Fixes #62036. Built from https://develop.svn.wordpress.org/trunk@59076 git-svn-id: https://core.svn.wordpress.org/trunk@58472 1a063a9b-81f0-0310-95a4-ce76da25c4cd
Trac ticket: Core-62036.
See also dmsnell#20
See also dmsnell#21
The HTML Processor understands HTML regardless of how it's written, but
many other functions are unable to do so. There are all sorts of syntax
peculiarities and semantics that would be helpful to eliminate using the
knowledge contained in the HTML Processor.
This patch introduces
WP_HTML_Processor::normalize( $html )
as amethod which takes a fragment of HTML as input and then returns a
serialized version of the input, "cleaning it up" by balancing all
tags, providing all missing optional tags, re-encoding all text,
removing all duplicate attributes, and double-quote-escaping all
attribute values.